This is a built-in data set for R from Fisher and Anderson. It contains four measurements (cols 1-4) for 50 samples each of three different iris spp. (col 5). Let’s run PCA on the data and see if we can visualize view it in a 2D plot:
data('iris')
head(iris) # see what it looks like
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
iris.x = iris[,1:4] # measurements
iris.y = iris[,5] # our response labels
iris.pca = prcomp(iris.x, scale=TRUE, center=TRUE)
summary(iris.pca)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
plot(iris.pca,main="Proportion of Variance for PCs 1-4",xlab="PCs 1-4")
It looks like the cumulative variance for PC1 and PC2 is greater than 95%. Let’s use these to make our plot.
plot(iris.pca$x[,1], iris.pca$x[,2], col=iris.y, main="2D PCA Plot of Iris Data",xlab="PC1 (Variance 73%)",ylab="PC2 (Variance 23%)")
We can see the three species cluster above, but wouldn’t it be nice to hover over a point and learn more about it? We can do this with Plotly!
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(plotly)
# We'll add the 3rd PC to the data frame
iris.pca.df = data.frame("PC1"=as.matrix(iris.pca$x[,1]), "PC2"=iris.pca$x[,2], "PC3"=iris.pca$x[,3], "Species"=iris.y)
# YOUR CODE HERE!